anonymized data [English]
Syndetic Relationships
InterPARES Definition
n. ~ Datasets that have had personally identifiable information (US) or personal data (EU) removed or obfuscated.
Citations
- Ohm 2010 (†712 p.1703-1704): Imagine a database packed with sensitive information about many people. Perhaps this database helps a hospital track its patients, a school its students, or a bank its customers. Now imagine that the office that maintains this database needs to place it in long-term storage or disclose it to a third party without compromising the privacy of the people tracked. To eliminate the privacy risk, the office will anonymize the data, consistent with contemporary, ubiquitous data-handling practices. First, it will delete personal identifiers like names and social security numbers. Second, it will modify other categories of information that act like identifiers in the particular context—the hospital will delete the names of next of kin, the school will excise student ID numbers, and the bank will obscure account numbers. What will remain is a best-of-both-worlds compromise: Analysts will still find the data useful, but unscrupulous marketers and malevolent identity thieves will find it impossible to identify the people tracked. Anonymization will calm regulators and keep critics at bay. Society will be able to turn its collective attention to other problems because technology will have solved this one. Anonymization ensures privacy. Unfortunately, this rosy conclusion vastly overstates the power of anonymization. Clever adversaries can often reidentify or deanonymize the people hidden in an anonymized database. This Article is the first to comprehensively incorporate an important new subspecialty of computer science, reidentification science, into legal scholarship. This research unearths a tension that shakes a foundational belief about data privacy: Data can be either useful or perfectly anonymous but never both. (†1629)
- Ohm 2010 (†712 p.1724): Many anonymization techniques would be perfect, if only the adversary knew nothing else about people in the world. In reality, of course, the world is awash in data about people, with new databases created every day. Adversaries combine anonymized data with outside information to pry out obscured identities. (†1631)
- Ohm 2010 (†712 p.1746): The accretion problem is this: Once an adversary has linked two anonymized databases together, he can add the newly linked data to his collection of outside information and use it to help unlock other anonymized databases. Success breeds further success. Narayanan and Shmatikov explain that “once any piece of data has been linked to a person’s real identity, any association between this data and a virtual identity breaks the anonymity of the latter.” This is why we should worry even about reidentification events that seem to expose only nonsensitive information, because they increase the linkability of data, and thereby expose people to potential future harm. (†1632)
- Ohm 2010 (†712 p.1752): Utility and privacy are, at bottom, two goals at war with one another. In order to be useful, anonymized data must be imperfectly anonymous. “[P]erfect privacy can be achieved by publishing nothing at all—but this has no utility; perfect utility can be obtained by publishing the data exactly as received from the respondents, but this offers no privacy.” No matter what the data administrator does to anonymize the data, an adversary with the right outside information can use the data’s residual utility to reveal other information. Thus, at least for useful databases, perfect anonymization is impossible. Theorists call this the impossibility result. There is always some piece of outside information that could be combined with anonymized data to reveal private information about an individual. (†1633)
- Zakerzadeh and Osborn 2013 (†711 p.424): Anonymization can be achieved through two techniques: Generalization and Suppression. In Generalization, the values of quasi-identifiers are replaced with more general values. For example, suppose country is one of the quasi-identifiers in a dataset. Generalization says that a value should be replaced by a more general value, so that the country would be replaced by the continent where the country is located. Suppression, on the other hand, means removing data from the dataset so that it is not released at all. Generalization and Suppression are complementary techniques. (†1628)